Comparing common and rare single variant and gene aggregate instrumentation strategies for MR
Author
Aimee Hanson
Published
May 13, 2025
Introduction
Classically, Mendelian Randomisation (MR) methods utilising trait-associated genetic variants from GWAS studies have employed common polymorphisms (population MAF > 1%) targeted by genotyping chips, or reliably imputed from reference populations, to instrument modifiable exposures. However, common variants tested in GWAS typically explain a very small fraction of the variability in a measured complex trait, potentially exhibit pleiotropic effects acted upon by balancing selection or as a consequent of genetic linkage, and are rarely causal. Rare variants, which typically show large biological effects (e.g. through abolishing protein expression) provide a means of more unambiguously instrumenting relevant molecular processes. Comparison of causal estimates derived using differing methods of genetically instrumenting modifiable exposures may enhance the interpretation of the biological mechanisms underlying exposure-outcome relationships. This includes using variants from across the allele frequency spectrum, but also leveraging rare variant aggregate approaches to instrument gene-level perturbations in expression and function.
Causal estimates for pairwise combinations of the exposure and outcome relationships below have been derived using twelve instrumentation strategies.
Exposures: Low density lipoprotein levels (LDL direct), Body Mass Index (BMI), Vitamin D, Triglycerides, Glycated Haemoglobin (HbA1c), Mean Platelet Volume (MPV), IGF-1, Waist-to-Hip Ratio (BMI-corrected), Red Blood Cell (RBC) erythrocyte count and Mean Corpuscular Volume (MCV)
Outcomes: Coronary Artery Disease (CAD), Type 2 Diabetes (T2D), Multiple Sclerosis (MS), Ischemic Stroke, Atrial Fibrillation (AF), Venous Thromboembolism (VTE), Prostate Cancer and Hypertension.
Instruments
Twelve sets of instruments for each exposure have been extracted from across three sources (DeepRVAT gene impairment scores, Genebass whole exome single variants and aggregate burden masks and UKB common variant GWAS summary statistics):
Common GWAS
Associated common variants from UKB GWAS (>1% MAF)
Genebass (variants)
Common exome-wide (>5% MAF, LD clumped)
Low-frequency exome-wide (1-5% MAF, LD clumped)
Rare exome-wide (0-1% MAF, both unfiltered and filtered to the top hit per gene)
Ultra-rare exome-wide (0-0.1% MAF, both unfiltered and filtered to the top hit per gene)
Add gene annotations for instruments (taken from variant/mask position in exome data for ExWAS studies and nearest gene for common variant GWAS studies)
Code
# Annotate DeepRVAT and burden masks with relevant gene# ExWAS single variants are already annotated# Annotate common variants with nearest gene based on VEP annotation:# List of rsIDs to extract from VEP file (using VEP online interface, returning single consequence per variant --pick)# common_instruments <- unlist(lapply(harmonised_studies, function(x){# return(x$opengwas_common$SNP)# })) |> unique()# write.table(common_instruments, file.path(data_dir,"variant_annotation","complextrait_openGWASinstruments.txt"), # row.names = F, col.names = F, quote = F)vep <- data.table::fread(file.path(data_dir,"variant_annotation","vep_openGWASinstruments.txt")) |> dplyr::filter(grepl("^[0-9]",Location))## Retain variant annotations for SNPs within 1kb of a protein coding gene onlyvep_coding <- vep |>filter(BIOTYPE =="protein_coding") |>filter(!(as.numeric(DISTANCE) >1000) | DISTANCE =="-")for(i in1:length(harmonised_studies)){for(j in1:length(harmonised_studies[[i]])){if(names(harmonised_studies[[i]][j]) =="deeprvat_genescore"){ harmonised_studies[[i]][[j]]$gene.exposure =toupper(harmonised_studies[[i]][[j]]$SNP) }elseif(names(harmonised_studies[[i]][j]) %in%names(instrument_type[instrument_type =="mask"])){ harmonised_studies[[i]][[j]]$gene.exposure =toupper(gsub("_.*","",harmonised_studies[[i]][[j]]$SNP)) }elseif (names(harmonised_studies[[i]][j]) =="opengwas_common"){ gene_symbols <- vep_coding[match(harmonised_studies[[i]][[j]]$SNP, vep_coding$`#Uploaded_variation`),c("SYMBOL","Consequence")] |>as.data.frame()names(gene_symbols) <-c("gene.exposure","vep.consequence") harmonised_studies[[i]][[j]] <-cbind(harmonised_studies[[i]][[j]], gene_symbols) }else{next } }}
IVW estimates across instrument sets (subsetting to shared genes)
Difference in causal effect estimates across instrument sets could be due to differences in the underlying biological processes that are being captured by the included variants. The above analysis was repeated with restriction to instruments hitting a common set of genes/gene regions across strategies: